In [1]:
import numpy as np
import pandas as pd
Understanding language's data structures is the most important part for a good programming experience. Poor understanding of data structures leads to poor code in terms of efficiency and readability.
These notes are devoted to walk through all Pandas structures but also providing further readings for a deeper knowledge. A previous Numpy datastructures knowledge is required.
The covered Pandas data structures are:
We can create Series from:
As optional parameter, we can provide an index. The passed index is a list of axis labels. Indexes, this will provide an effective way to access data (slicing).
Each Series has a dtype that corresponds to the variable type of Series.
From dictionary:
In this case if no index is provided, it is extracted from dictionary keys, while the data is extracted from values.
In [2]:
d = {'a':5.,'b':5.,'c':5.}
i = ['x','y','z']
s1 = pd.Series(d)
print(s1)
s1.index
Out[2]:
Otherwise, if index is passed, values with keys in index are pulled out, the rest are assigned to NaN. A NaN value means not assigned, and we will have to deal with these values in the future.
What is NaN?? (from Pandas docs) Some might quibble over our usage of missing. By “missing” we simply mean null or “not present for whatever reason”. Many data sets simply arrive with missing data, either because it exists and was not collected or it never existed. For example, in a collection of financial time series, some of the time series might start on different dates. Thus, values prior to the start date would generally be marked as missing.
In [3]:
d = {'a':5,'b':5,'c':5}
i = ['x','y','a','b']
s1 = pd.Series(d, index = i)
print(s1)
print(s1.dtype)
s1.index
Out[3]:
From list:
We can create series from lists. In this case we can provide an index if desired. However, index must have the same length as constructor list.
In [4]:
l = [5,5,5]
s1 = pd.Series(l)
print(s1)
i = ['x','y','z']
s1 = pd.Series(l, index = i)
print(s1)
#This would raise an error
#i = ['x','y','a','b']
#s1 = pd.Series(l, index = i)
#print s1
print(s1.dtype)
s1.index
Out[4]:
From numpy array:
In [5]:
s2 = pd.Series(np.array([3,20,5]),index=['a','b','c'])
print(s2)
print(s2.dtype)
s2.index
Out[5]:
From scalar:
In [6]:
s3 = pd.Series(5,index=['a','b','c'])
print(s3)
print(s3.dtype)
s3.index
Out[6]:
Series can have the attribute name. When dealing with DataFrames, Series names will be automatically assigned with its column name.
In [ ]:
s3 = pd.Series(5,index=['a','b','c'], name = 'Series3')
s3.name
Series can be accessed through position (numerical index), boolean ¿list? or key (axis of labels). Accessing by position is like working with numpy ndarrays while accessing through keys (axis of labels) is like working with dictionaries.
Position accessing
In [7]:
s2
Out[7]:
In [8]:
s2[1]
Out[8]:
In [10]:
s2[:1]
Out[10]:
Boolean list accessing
In [11]:
s2[[True,True,False]]
Out[11]:
In [14]:
s2[s2>4]
Out[14]:
Key accessing
In [15]:
s2[['a','b']]
Out[15]:
In [16]:
s2['a']
Out[16]:
In [17]:
'a' in s2
Out[17]:
In case of accessing an nonexistent key, a KeyError exception is thrown
In [19]:
try:
s2['z']
except KeyError:
print("Error handled")
To avoid errors, we can use Series get function, where a default value is returned in case of error.
In [20]:
s2.get('x',np.NaN)
Out[20]:
Vectorized operations can be done over pandas Series and also Series are accepted as most of NumPy operations. The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN.
In [22]:
s2
Out[22]:
In [21]:
print(s2 + 23)
# we can apply np functions
np.add(s2,23) == s2 + 23
Out[21]:
In [23]:
s1
Out[23]:
In [24]:
s2
Out[24]:
Operations are performed index-wise.
In [28]:
s2.reset_index() + s1.reset_index()
Out[28]:
In [ ]:
(s2 + s1).dropna()
In [ ]:
s2 ** 3
In [ ]:
np.exp(s2)
In [34]:
pd.Series([1,2,3,4,5],dtype=np.float32)
Out[34]:
For numerical variables, most common dtypes will be int and float.
For categorical variables strings are common types. However, when working with labels, more advanced categorical data management can be used (http://pandas.pydata.org/pandas-docs/stable/categorical.html) http://pandas.pydata.org/pandas-docs/stable/categorical.html]
Why strings are dtype object? (from (http://stackoverflow.com/questions/21018654/strings-in-a-dataframe-but-dtype-is-object)[http://stackoverflow.com/questions/21018654/strings-in-a-dataframe-but-dtype-is-object])
The dtype object comes from NumPy, it describes the type of element in a ndarray. Every element in a ndarray must has the same size in byte. For int64 and float64, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of save the bytes of strings in the ndarray directly, Pandas use object ndarray, which save pointers to objects, because of this the dtype of this kind ndarray is object.
In [35]:
pd.Series(["a","b","c"],dtype=str)
Out[35]:
Dates are categorical or numeric?
Days of the month, months, days of the week, etc... are considered categorical.
Specific dates, such as the day payments are received, birth dates, etc... are numeric.
In [ ]:
pd.Series(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
In [ ]:
import datetime
base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(0, 7)]
date_s = pd.Series(date_list)
print(date_s)
date_s[1] > date_s[2]
Missing Data
You can insert missing values by simply assigning to containers. The actual missing value used will be chosen based on the dtype.
For example, numeric containers will always use NaN regardless of the missing value type chosen:
In [36]:
import numpy as np
s = pd.Series(["a", "b", "c"])
s.loc[0] = None
s.loc[1] = np.nan
print(s)
s = pd.Series([1, 2, 3])
s.loc[0] = None
s.loc[1] = np.nan
print(s)
Missing values propagate naturally through arithmetic operations between pandas objects.
In [37]:
s1 = pd.Series([1,2,3])
s2 = pd.Series([1,np.nan,3])
s1 + s2
Out[37]:
In [38]:
s1.index
Out[38]:
In [39]:
s2.index
Out[39]:
The descriptive statistics and computational methods discussed in the data structure overview (and listed here and here) are all written to account for missing data. For example:
Cleaning/filling missing data:
pandas objects are equipped with various data manipulation methods for dealing with missing data.
In [40]:
s2 = pd.Series([1,np.nan,3])
print(s2.fillna(0))
print(s2.dropna())
print(s2.isnull())
In [42]:
s2[s2.isnull()]
Out[42]:
Exercise 1: Load iqsize.csv using csv library as a dictionary of Series.
Exercise 2: Check Series dtype. Are they correct? Why? What is the problem of having numerical variables as categorical?
Exercise 3: For sex variable, select those that are males (including synonyms).
Exercise 4: Count how many women and how many men there are. How many missing elements there are?
In [ ]:
import csv
import urllib2
url = 'https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat'
response = urllib2.urlopen(url)
data = list(csv.reader(response))
#data is a list of list
We can create series from a list of lists directly from data variable.
In [ ]:
pd.Series(data)
This is not a very useful Series object, as access to list items is not syntactically nice. However, let's try to put the all countries' airports in a Series.
In [ ]:
countries = pd.Series(np.array([airport[3] for airport in data]))
# this is a more interesting Series object
countries
In [ ]:
print(countries.index)
In [ ]:
## Access by number
print("-> Access by number")
print(countries[0])
## we can evaluate a function over all elements to get a boolean series
print((countries == "Spain").head())
print((countries[countries == "Spain"]).head())
print((countries[countries != "Spain"]).head())
print((countries[countries != "Spain"]).tail())
# we can have the list of countries that have (at least
print(countries.unique())
In [ ]:
# we can also access by key
countries[81]
We can see that there's an error between index and the number provided as index in the csv. Let's try to see what happened with the knowledge we have with Series.
In [ ]:
csv_index = pd.Series(np.array([airport[0] for airport in data]))
print(csv_index[csv_index.astype(int) != csv_index.index + 1])
In [ ]:
# we get the list of all missing values
shifted_index = csv_index[csv_index.index-1]
shifted_index = shifted_index.fillna(0)
shifted_index.index = csv_index.index
# we get the list of non consecutive values
non_consecutive = csv_index[csv_index.astype(int) != shifted_index.astype(int) + 1].astype(int)
print(non_consecutive.head(10))
# check if our assumption is true (1 at most consecutive values are missing)
print("the assumption is:", len(non_consecutive) + len(csv_index) == csv_index[len(csv_index)-1])
# let's see what happens
difference = (shifted_index.astype(int) + 1).add(-csv_index.astype(int))
print(difference[difference < -1])
We can create DataFrames from:
From dict of Series or dict
The result index will be the union of the indexes of the various Series. If there are any nested dicts, these will be first converted to Series. If no columns are passed, the columns will be the sorted list of dict keys.
In [43]:
d = {'one': pd.Series([1,2,3],index=['a','b','c']),
'two': pd.Series([1,2,3,4],index=['a','b','c','z']),
'three':{'a':1}}
df = pd.DataFrame(d)
df
Out[43]:
In [44]:
pd.DataFrame(d, index=['d', 'b', 'a'])
Out[44]:
In [45]:
pd.DataFrame(d, index=['d', 'b', 'a'],
columns=['two', 'three','four'])
Out[45]:
The row and column labels can be accessed respectively by accessing the index and columns attributes:
In [46]:
df.index
Out[46]:
In [47]:
df.columns
Out[47]:
From dict of ndarrays / lists
The ndarrays must all be the same length. If an index is passed, it must clearly also be the same length as the arrays. If no index is passed, the result will be range(n)
, where n is the array length.
In [50]:
d = {'one' : [1., 2., 3.,4.], 'two' : [4., 3., 2., 1.]}
pd.DataFrame(d)
Out[50]:
In [52]:
pd.DataFrame(d) + pd.DataFrame(d, index = ['a','b','c','d'])
Out[52]:
From structured or record array
This case is handled identically to a dict of arrays.
In [53]:
data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])
print(data)
data[:] = [(1,2.,'Hello'), (2,3.,"World")]
print(data)
df = pd.DataFrame(data)
print(df)
df.dtypes
Out[53]:
In [ ]:
pd.DataFrame(data, index=['first', 'second'])
In [ ]:
pd.DataFrame(data, columns=['C', 'A', 'B'])
From a list of dicts
In [54]:
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
pd.DataFrame(data2)
Out[54]:
In [ ]:
pd.DataFrame(data2, index=['first', 'second'])
In [ ]:
pd.DataFrame(data2, columns=['a', 'b'])
From a dict of tuples
You can automatically create a multi-indexed frame by passing a tuples dictionary
In [55]:
pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
....: ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
....: ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
....: ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
....: ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})
Out[55]:
(from Pandas Docs)
You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and deleting columns works with the same syntax as the analogous dict operations:
In [58]:
import pandas as pd
df = pd.DataFrame({"one":[1.6,2.2,3.4,3.5],"two":[1.5,2.1,3.9,np.nan],"three":[1.2,2.80,3.80,np.nan]})
print(df)
df['four'] = df['one'] * df['two']
display(df['flag'] = df['one'] > 2
df
Out[58]:
In [57]:
type(df['one'])
Out[57]:
Columns can be deleted or popped like with a dict:
In [ ]:
del df['two']
three = df.pop('three')
df
When inserting a scalar value, it will naturally be propagated to fill the column:
In [59]:
df['foo'] = 'bar'
df
Out[59]:
When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the DataFrame’s index:
In [60]:
df['one_trunc'] = df['one'][:2]
df
Out[60]:
In [61]:
df['one'][:2]
Out[61]:
You can insert raw ndarrays but their length must match the length of the DataFrame’s index.
By default, columns get inserted at the end. The insert function is available to insert at a particular location in the columns:
In [ ]:
df.insert(1, 'rand', np.random.randint(1,10,df["one"].size))
df
The basics of indexing are as follows:
Operation Syntax Result
Row selection, for example, returns a Series whose index is the columns of the DataFrame:
In [62]:
df
Out[62]:
In [63]:
print("> Select column")
print(df["one"])
print("> Select row by index label")
print(df.loc[1])
print("> Select row by integer location")
print(df.iloc[1])
print("> Slice rows")
print(df[2:4])
print("> Select rows by boolean vector")
print(df[[True,True,True,False]])
Both loc and iloc methods can be used to access cells or groups of cells. In this case the result is a DataFrame
In [64]:
print("> Slice the whole DF")
print(df.loc[:,:])
print("> Slice one row")
print(df.loc[1,:])
print("> Slice one column")
print(df.loc[:,"one"])
print("> Slice two columns")
print(df.loc[:,["one","rand"]])
print("> Slice two columns and two rows")
print(df.loc[[1,2],["one","rand"]])
print("> Slice the whole DF")
print(df.iloc[:,:])
print("> Slice one row")
print(df.iloc[1,:])
print("> Slice one column")
print(df.iloc[:,1])
print("> Slice two columns")
print(df.iloc[:,[1,2]])
print("> Slice two columns and two rows")
print(df.iloc[[1,2],[1,2]])
We will go through these examples using a real dataset. Remember from Series examples that we loaded a csv. We did it using csv
library, however, Pandas provide the necessary tools to read a csv file and output a DataFrame. Let's see an example using OpenFlight data. The data is structured as follows:
In [ ]:
import pandas as pd
import urllib2
url = 'https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat'
response = urllib2.urlopen(url)
head = ["Airport ID", "Name", "City", "Country", "IATA/FAA", "ICAO", "Latitude", "Longitude",
"Altitude", "Timezone", "DST", "Tz database time zone"]
data_frame = pd.read_csv(response,names=head)
data_frame.head()
In [ ]:
data_frame["Name"].head()
In [ ]:
(data_frame["Name"] + data_frame["ICAO"]).head()
In [ ]:
data_frame["Altitude (m)"] = (data_frame["Altitude"] * 0.3048)
data_frame["Seaside"] = data_frame["Altitude (m)"] < 20
data_frame[data_frame["Seaside"]].head()
In [ ]:
# Columns can be deleted or popped like with a dict:
del data_frame["Altitude"]
seaside = data_frame.pop('Seaside')
print(seaside.head())
data_frame.head()
In [ ]:
# When inserting a scalar value, it will naturally be propagated to fill the column:
data_frame["Infrastructure"] = "Airport"
Assigning New Columns in Method Chains
Inspired by dplyr’s mutate
verb, DataFrame has an assign() method that allows you to easily create new columns that are potentially derived from existing columns.
In [ ]:
(data_frame.assign(hemisphere_north = lambda x: x['Latitude'].astype(float) > 0)).head(10)
In [ ]:
Exercise 1: Load iqsize.csv using csv library. The result should be a DataFrame.
Exercise 2: Identify dataset variables. Resolve type problems. Can you change types?
Exercise 3: Check the range of quantitative variables. Are they correct? If not correct how would you correct it (don't expend many time).
Exercise 4: Check the labels of qualitative variables. Are they correct? If not correct them, how would you correct them?
Exercise 5: For quantitative variables, compute the mean and median.
Exercise 6: For qualitative variables, count how many observations of each label exist.
Exercise 7: Compute Exercise 7 statistics, but now for each label in Sex variable.